nder review
OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training
Li, Hongpei, Zhang, Han, Liu, Huikang, Ge, Dongdong, Ye, Yinyu
Pipeline parallelism (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices. However, despite recent progress in reducing memory consumption through activation offloading, existing approaches remain largely heuristic and coarse-grained, often overlooking the fine-grained trade-offs between memory, computation, and scheduling latency. In this work, we revisit the pipeline scheduling problem from a principled optimization perspective. We observe that prevailing strategies either rely on static rules or aggressively offload activations without fully leveraging the interaction between memory constraints and scheduling efficiency. To address this, we formulate scheduling as a constrained optimization problem that jointly accounts for memory capacity, activation reuse, and pipeline bubble minimization. Solving this model yields fine-grained schedules that reduce pipeline bubbles while adhering to strict memory budgets. Our approach complements existing offloading techniques: whereas prior approaches trade memory for time in a fixed pattern, we dynamically optimize the tradeoff with respect to model structure and hardware configuration. Experimental results demonstrate that our method consistently improves both throughput and memory utilization. In particular, we reduce idle pipeline time by up to 50% under the same per-device memory limit, and in some cases, enable the training of larger models within limited memory budgets.
Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models
Although large language models (LLMs) have transformed AI, they still make mistakes and can explore unproductive reasoning paths. Self-correction capability is essential for deploying LLMs in safety-critical applications. We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources - a limitation we term the Self-Correction Blind Spot. To study this phenomenon, we introduce Self-Correction Bench, an evaluation framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-reasoning models, we find an average 64.5% blind spot rate. We provide multiple lines of evidence suggesting this limitation may be influenced by training data: human demonstrations rarely include error-correction sequences (favoring error-free responses), whereas reinforcement learning (RL) trained models learn error correction via outcome feedback. Remarkably, appending a minimal "Wait" prompt activates a 89.3% reduction in blind spots, suggesting dormant capabilities that require triggering. Our work highlights a critical limitation potentially influenced by training distribution and offers a practical approach to enhance LLM reliability and trustworthiness - vital for safety-critical domains.
FMIP: Joint Continuous-Integer Flow For Mixed-Integer Linear Programming
Li, Hongpei, Yuan, Hui, Zhang, Han, Lin, Jianghao, Ge, Dongdong, Wang, Mengdi, Ye, Yinyu
Mixed-Integer Linear Programming (MILP) is a foundational tool for complex decision-making problems. However, the NP-hard nature of MILP presents a significant computational challenge, motivating the development of machine learning-based heuristic solutions to accelerate downstream solvers. While recent generative models have shown promise in learning powerful heuristics, they suffer from a critical limitation. That is, they model the distribution of only the integer variables and fail to capture the intricate coupling between integer and continuous variables, creating an information bottleneck and ultimately leading to suboptimal solutions. To this end, we propose Joint Continuous-Integer Flow for Mixed-Integer Linear Programming (FMIP), which is the first generative framework that models the joint distribution of both integer and continuous variables for MILP solutions. Built upon the joint modeling paradigm, a holistic guidance mechanism is designed to steer the generative trajectory, actively refining solutions toward optimality and feasibility during the inference process. Extensive experiments on eight standard MILP benchmarks demonstrate the superior performance of FMIP against existing baselines, reducing the primal gap by 41.34% on average. Moreover, we show that FMIP is fully compatible with arbitrary backbone networks and various downstream solvers, making it well-suited for a broad range of real-world MILP applications.
BenLOC: A Benchmark for Learning to Configure MIP Optimizers
Li, Hongpei, He, Ziyan, Wang, Yufei, Tu, Wenting, Pu, Shanwen, Deng, Qi, Ge, Dongdong
The automatic configuration of Mixed-Integer Programming (MIP) optimizers has become increasingly critical as the large number of configurations can significantly affect solver performance. Yet the lack of standardized evaluation frameworks has led to data leakage and over-optimistic claims, as prior studies often rely on homogeneous datasets and inconsistent experimental setups. To promote a fair evaluation process, we present BenLOC, a comprehensive benchmark and open-source toolkit, which not only offers an end-to-end pipeline for learning instance-wise MIP optimizer configurations, but also standardizes dataset selection, train-test splits, feature engineering and baseline choice for unbiased and comprehensive evaluations. Leveraging this framework, we conduct an empirical analysis on five well-established MIP datasets and compare classical machine learning models with handcrafted features against state-of-the-art deep-learning techniques. The results demonstrate the importance of datasets, features and baseline criteria proposed by BenLOC and the effectiveness of BenLOC in providing unbiased and comprehensive evaluations.
Are GNNs doomed by the topology of their input graph?
Aboussalah, Amine Mohamed, Ed-dib, Abdessalam
Graph Neural Networks (GNNs) have demonstrated remarkable success in learning from graph-structured data. However, the influence of the input graph's topology on GNN behavior remains poorly understood. In this work, we explore whether GNNs are inherently limited by the structure of their input graphs, focusing on how local topological features interact with the message-passing scheme to produce global phenomena such as oversmoothing or expressive representations. We introduce the concept of $k$-hop similarity and investigate whether locally similar neighborhoods lead to consistent node representations. This interaction can result in either effective learning or inevitable oversmoothing, depending on the inherent properties of the graph. Our empirical experiments validate these insights, highlighting the practical implications of graph topology on GNN performance.
Solving Integrated Process Planning and Scheduling Problem via Graph Neural Network Based Deep Reinforcement Learning
Li, Hongpei, Zhang, Han, He, Ziyan, Jia, Yunkai, Jiang, Bo, Huang, Xiang, Ge, Dongdong
The Integrated Process Planning and Scheduling (IPPS) problem combines process route planning and shop scheduling to achieve high efficiency in manufacturing and maximize resource utilization, which is crucial for modern manufacturing systems. Traditional methods using Mixed Integer Linear Programming (MILP) and heuristic algorithms can not well balance solution quality and speed when solving IPPS. In this paper, we propose a novel end-to-end Deep Reinforcement Learning (DRL) method. We model the IPPS problem as a Markov Decision Process (MDP) and employ a Heterogeneous Graph Neural Network (GNN) to capture the complex relationships among operations, machines, and jobs. To optimize the scheduling strategy, we use Proximal Policy Optimization (PPO). Experimental results show that, compared to traditional methods, our approach significantly improves solution efficiency and quality in large-scale IPPS instances, providing superior scheduling strategies for modern intelligent manufacturing systems.
Recurrent and Convolutional Neural Networks in Classification of EEG Signal for Guided Imagery and Mental Workload Detection
Postepski, Filip, Wojcik, Grzegorz M., Wrobel, Krzysztof, Kawiak, Andrzej, Zemla, Katarzyna, Sedek, Grzegorz
The Guided Imagery technique is reported to be used by therapists all over the world in order to increase the comfort of patients suffering from a variety of disorders from mental to oncology ones and proved to be successful in numerous of ways. Possible support for the therapists can be estimation of the time at which subject goes into deep relaxation. This paper presents the results of the investigations of a cohort of 26 students exposed to Guided Imagery relaxation technique and mental task workloads conducted with the use of dense array electroencephalographic amplifier. The research reported herein aimed at verification whether it is possible to detect differences between those two states and to classify them using deep learning methods and recurrent neural networks such as EEGNet, Long Short-Term Memory-based classifier, 1D Convolutional Neural Network and hybrid model of 1D Convolutional Neural Network and Long Short-Term Memory. The data processing pipeline was presented from the data acquisition, through the initial data cleaning, preprocessing and postprocessing. The classification was based on two datasets: one of them using 26 so-called cognitive electrodes and the other one using signal collected from 256 channels. So far there have not been such comparisons in the application being discussed. The classification results are presented by the validation metrics such as: accuracy, recall, precision, F1-score and loss for each case. It turned out that it is not necessary to collect signals from all electrodes as classification of the cognitive ones gives the results similar to those obtained for the full signal and extending input to 256 channels does not add much value. In Disscussion there were proposed an optimal classifier as well as some suggestions concerning the prospective development of the project.
Imputation Strategies Under Clinical Presence: Impact on Algorithmic Fairness
Jeanselme, Vincent, De-Arteaga, Maria, Zhang, Zhe, Barrett, Jessica, Tom, Brian
Machine learning risks reinforcing biases present in data, and, as we argue in this work, in what is absent from data. In healthcare, biases have marked medical history, leading to unequal care affecting marginalised groups. Patterns in missing data often reflect these group discrepancies, but the algorithmic fairness implications of group-specific missingness are not well understood. Despite its potential impact, imputation is often an overlooked preprocessing step, with attention placed on the reduction of reconstruction error and overall performance, ignoring how imputation can affect groups differently. Our work studies how imputation choices affect reconstruction errors across groups and algorithmic fairness properties of downstream predictions.
Homogeneous Learning: Self-Attention Decentralized Deep Learning
Federated learning (FL) has been facilitating privacy-preserving deep learning in many walks of life such as medical image classification, network intrusion detection, and so forth. Whereas it necessitates a central parameter server for model aggregation, which brings about delayed model communication and vulnerability to adversarial attacks. A fully decentralized architecture like Swarm Learning allows peer-to-peer communication among distributed nodes, without the central server. One of the most challenging issues in decentralized deep learning is that data owned by each node are usually non-independent and identically distributed (non-IID), causing time-consuming convergence of model training. To this end, we propose a decentralized learning model called Homogeneous Learning (HL) for tackling non-IID data with a self-attention mechanism. In HL, training performs on each round's selected node, and the trained model of a node is sent to the next selected node at the end of each round. Notably, for the selection, the self-attention mechanism leverages reinforcement learning to observe a node's inner state and its surrounding environment's state, and find out which node should be selected to optimize the training. We evaluate our method with various scenarios for an image classification task. The result suggests that HL can produce a better performance compared with standalone learning and greatly reduce both the total training rounds by 50.8% and the communication cost by 74.6% compared with random policy-based decentralized learning for training on non-IID data.
Understanding Catastrophic Overfitting in Single-step Adversarial Training
Kim, Hoki, Lee, Woojin, Lee, Jaewook
Adversarial examples are perturbed inputs that are designed to deceive machine-learning classifiers by adding adversarial perturbations to the original data. Although fast adversarial training have demonstrated both robustness and efficiency, the problem of "catastrophic overfitting" has been observed. It is a phenomenon that, during single-step adversarial training, the robust accuracy against projected gradient descent (PGD) suddenly decreases to 0% after few epochs, whereas the robustness against fast gradient sign method (FGSM) increases to 100%. In this paper, we address three main topics. (i) We demonstrate that catastrophic overfitting occurs in single-step adversarial training because it trains adversarial images with maximum perturbation only, not all adversarial examples in the adversarial direction, which leads to a distorted decision boundary and a highly curved loss surface. (ii) We experimentally prove this phenomenon by proposing a simple method using checkpoints. This method not only prevents catastrophic overfitting, but also overrides the belief that single-step adversarial training is hard to prevent multi-step attacks. (iii) We compare the performance of the proposed method to that obtained in recent works and demonstrate that it provides sufficient robustness to different attacks even after hundreds of training epochs in less time. All code for reproducing the experiments in this paper are at https://github.com/Harry24k/catastrophic-overfitting.